Text based concordance

Sketchboard

  1. a tested querybuilder
  2. find all occurences

ROUTE A

  1. take their txt elements
  2. parse these using regex to highlight the term/pattern
    • what with punctuation etc?
    • does the txt element contain xml tags? (for instance, quotes)
      • is that necessary?
  3. make these into lines

ROUTE B

  1. use the raw text
  2. use character offsets

In [1]:
class Concordance:
    pass

In [2]:
class Corpus:
    pass

    def build_concordance(self, terms, search_type):
        pass
        #return concordance

Or use NLTK's built-in functions


In [3]:
import re

test = 'Here is an important string; about an important topic.'
# use \b to match the start of a word
results = re.finditer(r'\ban', test)

In [4]:
def print_hits(results):
    for hit in results:
        start, end = hit.start(), hit.end()
        print test[start:end]

print_hits(results)


an
an

In [5]:
def print_concordance(test, results):
    raw_text = test.split()
    print raw_text

print_concordance(test, results)


['Here', 'is', 'an', 'important', 'string;', 'about', 'an', 'important', 'topic.']

In [6]:
def print_concordance(test, term):
    #FIXME how would this handle phrases?
    raw_text = test.split()
    indices = [i for i, word in enumerate(raw_text) if word == term]
    hits = indices
    print hits

print_concordance(test, 'an')


[2, 6]

In [7]:
#TODO do this on strings!

indices = [m.start() for m in re.finditer(r'\ban', test)]
print indices


[8, 35]

In [8]:
for i in indices:
    print test[-10:i] + '\t' + 'an' + '\t' + test[i+len('an'):10]


	an	
	an	

In [9]:
for i in indices:
    left = test[:i]
    right = test[i+len('an'):]
    print left[-15:] + '\t' + 'an' + '\t' + right[:15]

#todo whole words?


Here is 	an	 important stri
 string; about 	an	 important topi

In [10]:
test[:35]


Out[10]:
'Here is an important string; about '